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Abstract. In addition to the frequency of terms in a document collection, the dis- 
tribution of terms plays an important role in determining the relevance of docu- 
ments. In this paper, a new approach for representing term positions in documents 
is presented. The approach allows an efficient evaluation of term-positional infor- 
mation at query evaluation time. Three applications are investigated: a function- 
based ranking optimization representing a user-defined document region, a query 
expansion technique based on overlapping the term distributions in the top-ranked 
documents, and cluster analysis of terms in documents. Experimental results 
demonstrate the effectiveness of the proposed approach. 

1 Introduction 

The information retrieval (IR) process has two main stages. The first stage is the index- 
ing stage in which the documents of a collection are processed to generate a database 
(index) containing the information about the terms of all documents in the collection. 
The index generally stores only term frequency information, but in some cases posi- 
tional information of terms is also included, substantially increasing the memory re- 
quirements of the system. 

In the second stage of the IR process (query evaluation), the user sends a query to the 
system, and the system responds with a ranked list of relevant documents. The imple- 
mented retrieval model determines how the relevant documents are calculated. Standard 
IR models (e.g. TFIDF, BM25) use the frequency of terms as the main document rele- 
vance criterion, producing adequate quality in the ranking and query processing time. 
Other approaches, such as proximity queries or passage retrieval, complement the doc- 
ument relevance evaluation using term positional information. This additional process, 
normally performed at query time, generally improves the quality of the results but also 
slows down the response time of the system. Since the response time is a critical issue 
for the acceptance of an IR system by its users, the use of time-consuming algorithms 
to evaluate term-positional information at query time is generally inappropriate. 

The IR model proposed in this paper shifts the complexity of processing the posi- 
tional data to the indexing phase, using an abstract representation of the term positions 



and implementing a simple mathematical tool to operate with this compressed repre- 
sentation at query evaluation time. Thus, although query processing remains simple, 
the use of term-positional information provides new ways to optimize the IR process. 
Three applications are investigated: a function-based ranking optimization representing 
a user-defined document region, a query-expansion technique based on overlapping the 
term distributions in the top-ranked documents, and cluster analysis of terms in docu- 
ments. Experimental results demonstrate the effectiveness of the proposed approach for 
optimizing the retrieval process. 

The paper is organized as follows. Section 2 discusses related work. Section 3 
presents the proposed approach for representing term positions based on truncated 
Hilbert space expansions. In Section 4, applications of the approach are described. Sec- 
tion 5 concludes the paper and outlines areas for future work. 

2 Related Work 

An early approach to apply term-positional data in IR is the work of Attar and Fraenkel 
[2]. The authors propose different models to generate clusters of terms related to a query 
(searchonyms) and use these clusters in a local feedback process. In their experiments 
they confirm that metrical methods based on functions of the distance between terms 
are superior to methods based merely on weighted co-occurrences of terms. There are 
several other approaches that use metrical information [3, 7]. 

One of the first approaches using abstract representations of term distributions in 
documents is Fourier Domain Scoring (FDS), proposed by Park et al. [6]. FDS per- 
forms a separate magnitude and phase analysis of term position signals to produce an 
optimized ranking. It creates an index based on page segmentation, storing term fre- 
quency and approximated positions in the document. FDS processes the indexed data 
using the Discrete Fourier Transform to perform the corresponding spectral analysis. 

A recent approach based on an abstract representation of term position is Fourier 
Vector Scoring (FVS) [4]. It represents the term information (Fourier coefficients) di- 
rectly as an n-dimensional vector using the analytic Fourier transform, permitting an 
immediate and simple term comparison process. 

3 Analyzing Term Positions 

In this section, a general mathematical model to analyze term positions in documents is 
presented, making it possible to effectively use the term-positional information at query 
evaluation time. 

Consider a document D of length L and a term t that appears in D. The distribution 
of the term t within the document is given by the set Vt that contains all positions of t, 
where all terms are enumerated starting with 1 for the first term and so on. For example, 
a set V t — {2, 6} represents a tern that is located at the second and sixth position of the 
document body. A characteristic function 




(1) 



defined for x G [0, L], is assigned to Vt- 

The proposed method consists of approximating this characteristic function by an 
expansion in terms of certain sets of functions. In order to do so, some concepts of 
functional analysis are introduced. Details can be found in the book of Yosida [9]. 



3.1 Expansions in Hilbert Spaces 

A Hilbert space H is a (possibly infinite-dimensional) vector space that is equipped 
with a scalar product (., .), i. e. two elements /, g £ H are mapped to a real or complex 
number (/, g). We only consider real scalar products here. 

An example of a Hilbert space is the space L2([0,L]) defined as the set of all 
functions / that are square-integrable in the interval [0, L], i. e. functions for which 
Jo (f( x )) 2 dx < oo . In this vector space, the addition of two functions / and g, 
and the multiplication of a function / by a scalar a e R are defined point-wise: 
(/ + 9){ x ) = f( x ) + ! ( a f)( x ) — a f( x ) ■ The scalar product in -L2QO, L]) is 
defined by 



1^ 

(f,9) = J f(x)g(x)dx. (2) 


Two vectors with vanishing scalar product are called orthogonal. 
The scalar product induces a norm (an abstract measure of length) 

11/11 = VUJ)>0- 0) 

With the help of this norm, the notion of convergence in H can be defined: A sequence 
fo, /1, . . . of vectors of Ti is said to converge to a vector /, symbolically lim^oo /„ = 
/, if lim^oo \\f n — /I) = 0. This allows to define an expansion of a vector / in terms 
of a set of vectors {ip a , <pi, . . .}. One writes 
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f = ^2lkfk, (4) 

fc=0 

where the 7^ are real numbers, if the sequence /„ = J2k=o Ik^Pk of finite sums con- 
verges to /. This kind of convergence is called norm convergence. 

Of particular importance are so-called complete, orthonormal sets {tpo, <pi, ■ ■ ■} of 
functions in H. They have the following properties: (a) The ipi are mutually orthogonal 
and normalized to unity: 

/ \ r f 1 for n = m , c . 

W",<P m )=S nm = ^ Q for n + m (5) 

(b) The ipi are complete, which means that every vector of the Hilbert space can be 
expanded into a convergent sum of them. 

Important properties of expansions in terms of complete orthonormal sets are: (a) The 
expansion coefficients jk are given by 



(6) 



(b) They fulfill 

n OO 

£ 7fc 2 <||/|| 2 faralln, and £ 7 , 2 = ||/|| 2 (7) 

fc=0 fe=0 

(Bessel's inequality and Parseval's equation). 

Given two expansions / = J2T=o Ikfk, 9 = J2T=o I'kWk, the scalar product can 
be expressed as 

oo 

(/,$> = S>fc7fc- (8) 

fe=0 

If the expansion coefficients are combined into coefficient vectors c = (70,71, • ■ •)> 
c! = (70, 7i, • ■ •)> the preceding equation takes the form (/, g) = c • d . 

The Fourier expansions considered by Galeas et al. [4] are an example of such an 
expansion. The functions 



vnr 1 Po / x /2 . /27rfc\ Po , . 2 ( 2irk 

¥>0 (X) = , ¥>2k-l( X ) = ]/l Sm ) ' ^2feW = Y ^ COS I "Y" 

(fc > 0) form a complete orthonormal set in L 2 ([0, L]), leading to an expansion 
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[2irkx\ , {2irkx\ 
a k cos I — — I + o fe sm I — — I 



(10) 



where a = 70 and a k = -y 2 k, h = 72fc-i for k > 0. 

Another complete set of orthonormal functions of -L2QO, L]) is given by 



tf?{x) = yj — ^Pk*(x/L), k>0, (11) 

where the P k (x) are so-called shifted Leg end re polynomials [1]. These polynomials are 
of order*;. The first few of them are P * (a: ) = l.P^a;) = 2x-l,P 2 *(x) = 6x 2 -6a;+l, 
P 3 * (a;) = 20x 3 - 30x 2 + 12a: - 1. Fig. 1 (left) shows ip\ c (x) for < k < 4 in the range 
x e [0,L] for L = 1. 

Another example that will be used later is a complete set for the space L 2 (IR + ) (the 
space of square-integrable functions for < x < 00): 
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-x/(2A) 



^ a (a:)= " X fc (a:/A), fc>0. (12) 

Here, A is a positive scale parameter and the L k (x) are Laguerre polynomials [1], the 
first few of which are £0(2;) = 1, £i(a;) = — x + 1, L 2 (a;) = a; 2 /2 — 2x + l, L 3 (x) = 
-x 3 /6 + 3x 2 /2 - 3a; + 1, see Fig. 1 (right). 



3.2 Truncated Expansions of Term Distributions 

As explained above, the finite sums /„ = Y^k=o Ikfk converge to the function / in the 
sense of norm convergence. As a consequence of Bessel's inequality (7) they approx- 
imate / increasingly better for increasing n. An essential ingredient for the following 



Fig. 1. Left: Shifted Legendre polynomials tp\ e {x) for < k < 4. Right: The expansion func- 
tions (12) for < k < 4 and A = 1 
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Fig. 2. Fourier distribution of Vt = {3, 8} in document D, using different Fourier orders n. 



discussion is to consider a truncated expansion, i. e. the mapping 



(13) 



which associates to a term distribution /W of the form (1) its finite-order approximation 
fn^ in terms of some complete orthonormal set for some order n. 

Figure 2 shows an example for the Fourier expansion. One can observe the charac- 
teristic broadening effect generated by the reduction of the expansion order (truncation). 

The L 2 scalar product of two truncated term distributions /„ and g n , 



(fn,9n) = / fn(x)g n (x)dx 



(14) 



has the meaning of an overlap integral: The integrand is only large in regions in which 
both functions f n (x) and g n (x) are large, so that (f n ,9n) measures how well both 
functions overlap in the whole integration range. 

Given /„ and g n , two truncated term distributions describing the term positions 
and their neighborhood in a certain document, we introduce the concept of semantic 
interaction range: Two terms that are close to each other present a stronger interaction 



because their truncated distributions have a considerable overlap. This semantic interac- 
tion range motivates the following definition of the similarity of two term distributions 
/ and g: For some fixed order n, one sets 

sim(/, g) = (/„, g n ) = (P n f, P n g) . (15) 

In this definition, the truncation P n : f t— » /„ is essential, because the original term 
distributions / and g are always orthogonal if they describe two different terms. This is 
so because different terms are always at different positions within a document, so that 
their overlap always vanishes. 

Definition (15) is only one possibility. In fact, any definition based on the scalar 
product (f n ,g n ) can be utilized. For example, in Galeas et al. [4] a cosine definition 
cos i9 — n^li'ij^n has been used. Another choice is the norm difference 

\\fn-g n \\= (J(fn(x)-g n (x)) 2 dx^ ' = Vll/nl| 2 + ll.9nl| 2 -2(.f„,ff n >. (16) 

Using different measures based on (f n ,g n ), we have found no significant differences 
in the final retrieval results in several experiments. 

The scalar product of the truncated distributions can be easily calculated using the 
coefficient vectors: If the original distributions / and g have the infinite-dimensional 
coefficient vectors c = (70,71, • • •) and c' = (7o>7i>- • ■)> respectively, then the trun- 
cated distributions /„ and g n have the (n + 1) -dimensional coefficient vectors c„ = 
(70, 71, ... , 7„) and c' n = (-f' , j n ), resp., and their scalar product is the finite 

sum 

n 

(fn, gn) = C n ■ c' n = Ikl'k ■ (17) 
fc=0 



3.3 The Semantic Interaction Range 

In this section, a precise definition of the semantic interaction range is given. 

In abstract terms, the truncation P n : f /„ is a filtering or a projection: In the 
expansion f(x) = X^fclo lkfk{ x ) the components Lp k for k > n are filtered out, which 
amounts to a projection of / onto the components tpo,..., ip n . Thus, P n is a projection 
operator in the Hilbert space. To derive a closed expression for the operator P n , one 
combines (P n f)(x) = f n (x) = J2k=o lWk(x), with (6) to obtain 

(P n f)(x) = ( I Mv)f(y)dy) Mx)= I [J2^(y)Mx))f(y)dy. 

fe=o \ J J J \ k=0 J 

(18) 

One can write the last expression as / p n (y, x)f(y) dy with the projection kernel 



n 

Pn{y,x) = ^2tp k (y)tp k (x) 

k=0 



(19) 
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Fig. 3. Left: Projection kernel for the Fourier expansion showing the semantic interaction range 
for two terms at the positions 20 and 100, for n — 6 and L = 200. Right: Projection kernel for 
the expansion in terms of Laguerre polynomials showing the semantic interaction range for two 
terms at the positions 20 and 100, for n = 6 and A = 15. 



as an integral representation of P n in the sense of a convolution. It has the advantage 
that one can study the properties of the truncation independently of the function /. 

The width of p n (y, x) as a function of a; is a lower bound for the width of a truncated 
expansion of a term located at y. Therefore, this width will be used as the semantic 
interaction range for a term at position y. 

For the Fourier expansion, p 2 k is given by 

F _ cos(47rfc(;/ - x) / L) - cos(27r(2fc + l)(y - x) / L) 

P2k\V-> x — F7i aw \ /t\\ ' 

L{1 — cos(27r(j/ — xj/Lj) 

(We consider only even orders n = 2k, because for these orders the expansion consists 
of an equal number of sine and cosine terms, see (9).) The maximum of P2k(y, x) is at 
x = y and the two zeros closest to the maximum are at x = y ± L/(2n +1). Thus, the 
semantic interaction range for a Fourier expansion of order n may be defined to be 



^=2^1- (21) 

Fig. 3 (left) shows p^°(20, x) and p£°(100, x) for L = 200. 

For the expansions in terms of Legendre and Laguerre polynomials, the projection 
kernels can be calculated with the Christoffel-Darboux equation [1], The results are 

K^^)=< y " +i( " )y " (3:) ; y " (j/) ^ +i(a;) ; (22) 

y x 

i = Le, La, with ajf = (L/2)(n + l)/(2n + 1) and c^ a = -A(n + 1). These kernels 
are no longer functions of y — x, meaning that the broadening of a term distribution 
depends on the position y of the term distribution within the document. 

Fig. 3 (right) shows the projection kernel p$ a (y, x) for y — 20 and y — 100. One 
can see that the spatial resolution of the truncated expansion decreases for terms that 
are far away from the beginning of the document. 
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4 Applications 



The goal of our approach is to shift the complexity of processing the positional data 
from the query evaluation phase to the (not time critical) indexing phase, reducing the 
ranking optimization via term positions to a simple mathematical operation. 

Hence, we propose to calculate the expansion coefficients 7^ of the term distribu- 
tions in the indexing phase and to store this abstract term positional information in the 
index. This permits a considerably faster query evaluation, compared with methods that 
use the raw term-positional information. 

Thus, the index contains an (n+ 1 ) -dimensional coefficient vector c„ = (70 , 71 , . . . , 
7„) for each term and each document in the collection. The j k are calculated analyti- 
cally via (6). To give an example of the complexity involved, 



k 

pev t j=o 



p\i+ 1 (p-l 



(!) 



(23) 



with cej = \J (2k + 1)L dj/(j + 1) is the expression for the expansion coefficients in 
the case of the expansion in terms of Legendre polynomials, cf. (1). (The aj are the 
polynomial coefficients of the shifted Legendre polynomial of order k.) Calculations of 
this kind can be easily performed in the indexing stage. 

The retrieval scenarios that we have investigated are: (a) ranking optimization based 
on user-defined objective functions and (b) query expansion based on term-positional 
information [4], and (c) cluster analysis of terms in documents. They all involve a cal- 
culation of the similarity of term distributions. 



4.1 Ranking Optimization 

The first scenario states document ranking as an optimization problem that is based 
on the query term distribution function f q ^ and a user-defined objective function / Q 
representing the optimal query term distribution in the document body: 

Maximize {sim(f qtd ,f )} V/^gi (24) 

where A represents the query term distributions in a document set, f q _ d is the query 
term distribution function for query q in document d, and f is a user-defined objec- 
tive function, representing the optimal query term distributions for the documents in 
the document ranking. Experiments based on the TREC-8 collection and the software 
Terrier [5], carried out to order n = 6, show the accuracy of the term distributions in a 
ranking based on user-defined objective functions. As depicted in Figure 4, the Fourier 
and Legendre models present a high accuracy for the distribution of query terms in the 
top-20 ranked documents, based on two different objective functions: The first function 
(denoted f Q = 1|3) selects terms located in the first third of the document, and the 
second ( f Q = 3|3) selects terms located in the last third of the document [4]. 
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Fig. 4. Objective function performance for the Fourier, Legendre and Laguerre models. The x 
axis shows the TREC topics 400 to 450, the y axis is the term position relative to the normalized 
document length. The circle and the rectangle bounds represent the range of the query term 
positions for the objective functions 1|3 and 3|3 respectively. 



4.2 Query Expansion 

The second scenario considers the top-r documents D = {di, d 2 , ■ ■ ■ , d r } of an initial 
ranking process and the functions f q _d with d € D. The set of terms T q whose elements 
t maximize the expression sim(f q d,ft,d) is computed. It contains the terms for all 
documents in D that have a similar distribution as the query, i.e. terms positioned near 
the query in the top ranked documents. This set T q is used to expand q. 

As depicted in Figure 5, experiments executed on the TREC-8 collection demon- 
strate that query expansion based on the proposed orthogonal functions (Fourier and 
Laguerre) outperform state-of-the-art query expansion models, such as Rocchio and 
Kullback-Leibler [5]. The term position models (left) differ from the other models 
(right) because the former tend to increase the retrieval performance by increasing the 
number of expansion documents and expansion terms, while for the other models, the 
performance drops beyond roughly the 15 th expansion document. 

Figure 6 (left) shows a fixed query expansion configuration in which the other mod- 
els show their best performance. Nevertheless, the term distribution models perform 
better. Any increase in the number of expansion documents or expansion terms makes 
the superiority of the term distribution models even clearer. 



4.3 Cluster Analysis of Terms in Documents 



Given a document, one may ask whether there are groups (clusters) of terms whose 
elements all have similar distributions. One may then infer that all terms inside a cluster 
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Fig. 5. Precision at 10 documents for term positional models and two other models, using different 
query expansions configurations. The axes labeled documents and terms correspond to \D\ and 
\T q \, respectively. 



describe related concepts [2]. In this section, some properties of the proposed method 
will be explained that may be useful for the analysis of term clusters. 

Consider a document of length L. Since at every position within the document a 
particular term may either be present or not, there are in total N = 2 L possible term 
distributions. Each of these distributions is mapped to a point in an (n+ 1) -dimensional 
Hilbert space. If the norm difference (16) is used as the similarity criterion, then clusters 
of similar term distributions are just Euclidean point clusters in the Hilbert space. 

We will now investigate the geometrical structure of the set of all possible term 
distributions. Let us first calculate the center f(x) — (l/N) Yl v =x f^ v \ x ) °f a ^ term 
distributions (here (x), v = 1, . . . , N, is an enumeration of distributions of the form 
(1)). At any position x, half of all N distributions have a term present (x) = 1) and 
the other half does not (/ (l/) (x) = 0), so that /(x) = 1/2 = const for all x € [0, L\. 
This average distribution is mapped to a non-truncated, in general infinite-dimensional 
coefficient vector c, whose length |c| is given by the norm ||/|| = [J^dx/4] 1 / 2 = 
y/L/2. The squared distance between the center point and the coefficient vector of 
a distribution /(") is \c-c^\ 2 = ||/-/ (iy) || 2 = / Q L ( 1/2 ~f^(x)) 2 dx. Since f { "\x) 
is either or 1, it follows that (l/2-/ (l/) (x)) 2 = 1/4 = const for all x <E [0, L], giving 
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Fig. 6. Left: Query Expansion performance for the term distribution models (Fourier,Legendre 
and Laguerre) and the other models, using a configuration of 15 expanded documents and 40 
expanded terms. Right: Three dimensional sphere of all 512 possible term distributions in a doc- 
ument of length L = 9 for the expansion in terms of Legendre polynomials. 



— c\ = VL/2 for all v. This means that the non-truncated coefficient vectors of 
all term distributions lie on the surface of a sphere with radius VT/2 whose center is at 
c. Because |c| = VL/2, this sphere touches the origin of the Hilbert space. 

Bessel's inequality (7) leads to |cn ^ — c„| < \fhj2 for all v for the coefficient 
vectors truncated to order n. Thus, the truncated vectors all lie within a sphere of radius 

Ro = y/L/2 (25) 

in the (n+ 1) -dimensional Hilbert space. The center of this sphere is at c n . If — as in the 
Fourier and Legendre cases — one of the expansion functions, say ipo(x), is constant, the 
vector c describing itself a constant function has only a non- vanishing zero component: 
c = c n — (vL/2,0,0, . . .). Fig. 6 (right) shows this term sphere in n + 1 = 3 
dimensions for a document of length L = 9 and the expansion in terms of Legendre 
polynomials. 

The fact that all possible truncated coefficient vectors c« ' lie within a sphere whose 
radius and center are known is very useful for clustering analysis. First of all, it shows 
where in the Hilbert space to look for clusters. Secondly, assume one has found a cluster 
K = {ki , . . . , k q } of term distributions by some clustering algorithm (for an nth order 
truncation). The volume of this cluster can be estimated by calculating the standard 
deviation R K = [(1/q) Eti( k * ~ W\ 1/2 - [(l/(2 9 2 )) EL=i( fc * " k ^' 2 < here 
k is the center of the cluster) and approximating the cluster by a sphere of radius Rk- 
Since the volume of a sphere of radius Rk in n+ 1 dimensions is proportional to 
the cluster occupies approximately a part £ = (Rk / Ro) n+1 = (2Rk / V~L) n+1 of the 
theoretically available space. A cluster would then be considered as significant only if 
£ <C 1. An analysis of this kind may be useful to generate an ontology of terms based 
on individual documents. 



It has been conjectured that the use of quantum mechanical methods, in particular 
infinite-dimensional Hilbert spaces and projection operators, may be advantageous in 
IR [8]. The approach presented here goes into this direction, because constructing ap- 
propriate sets of orthogonal functions is a standard technique in quantum mechanics. 
Still, we emphasize that our approach is essentially classical, not quantum mechanical, 
since it does not use any of the interpretational subtleties of quantum mechanics. 

5 Conclusions 

In this paper, a new approach to improve document relevance evaluation using trun- 
cated Hilbert space expansions has been presented. The proposed approach is based on 
an abstract representation of term positions in a document collection which induces a 
measure of proximity between terms (semantic interaction range) and permits their di- 
rect and simple comparison. Based on this abstract representation, it is possible to shift 
the complexity of processing term-positional data to the indexing phase, permitting 
the use of term-positional information at query time without significantly affecting the 
response time of the system. Three applications for IR were discussed: (a) ranking opti- 
mization based on a user-defined term distribution function, (b) query expansion based 
on term-positional information, and (c) a cluster analysis approach for terms within 
documents. 

There are several areas of future work. For example, (a) quantifying the effect of 
the abstract term positions representation in the index size, (b) measuring the effec- 
tiveness of the proposed clustering approach, and (c) studying objective functions in 
documents having homogeneous structures (forms) are some of the topics that should 
be investigated. 
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